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Abstractions  for  Fault  Tolerance  in  Distributed  Systems 

FredBw  Schneider 
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Cotaen  UnhwMrity 
ldiM>.  New  York.  U.&A.  14SS3 


Abmctim  ufcful  in  fuilt>taierant  and  (fistriboted  lyneaiB  an  deanrihed.  Tbe  ebatnedona  are 
specified  aa  properties  of  pratooals,  hence  they  have  a  (fittennt  fiavor  from  abmcdona  prevalent  in 
sequential  and  concumnt  pragranming.  Among  the  abatactians  ifimwarri-an  agreement,  onler, 
(iailum  detection,  and  stable  stonge. 


1.  IntrodnctioD 

Distributed  computing  and  fault  tolennce  are  closely 
linked.  Fault  tokranae  can  only  be  achieved  by  replication 
of  function  using  components  with  independent  failure 
modes.  In  a  tfistrifauted  system,  tbe  physical  sepantion 
and  isolation  of  processors  linked  by  a  communications 
network  ensure  t^t  procesaore  have  independent  failure 
modes.  Thus,  achieving  fault  tolennce  in  a  computing 
system  can  lead  to  solving  problems  traditionally  asanciatrd 
with  distributed  computing  systems.  Distributed  systems, 
on  tbe  other  hand,  frequently  involve  replicated  function 
and  data  for  perfortnanoe  reasons.  Protixxils  to  manage 
replication  are  clearly  integral  to  such  systems.  Finally,  as 
the  sire  of  a  distributed  system  increases,  so  does  tbe  prob* 
ability  that  one  or  more  of  its  components  will  (aiL  A 
large  system  must  oontimie  to  function  despite  some 
failures,  or  it  will  be  unusaUe  ma«  of  the  time.  Thus,  dis¬ 
tributed  systems  at  least  large  ones  must  be  fault 
tolerant. 

One  caimut  simply  employ  “sepantion  of  concerns’* 
to  decompoae  a  protdm  into  its  (ault-tolerance  aspects  and 
its  distributed  computing  aspects.  The  two  concerns  can¬ 
not  be  separeted  because  each  requires  tbe  otber  in  its 
implementation.  Moreover,  support  for  tUstribution  and 
fault  toieranee  pervade  the  Iowct  levela  cf  a  system,  so 
retrofitting  tbe  necessary  support  onto  an  existing  applka- 
tion  CM  result  in  m  unacceptable  performance  pe^ty. 
Wlwn  constructing  a  system,  the  prntdems  assoriated  with 
fault  tolerance  and  distribution  must  be  confronted 
togetber  and  from  the  outset 

Only  by  using  abstracrtoiu,  which  capture  the  impor- 
tMt  pro(«ties  of  m  object  of  interest  Md  suppress 
inekvMt  details,  cn  one  hope  to  master  the  compi^ty 
assodaied  with  supporting  fault  tolerance  Md  distribotion. 
Of  course  to  be  u^ul,  m  abstraction  must  be  implement- 
able.  For  example,  it  is  easy  to  implement  a  fault-tolerMt 
system  by  assuming  fault-free  prooessots,  but  implementing 
a  fault-free  processor  is  impossible,  rendering  it  a  useless 
abstraction  for  building  real  systems. 

A  Wnue/  cimui  [34]  is  m  example  of  a  good  abstrac- 
□on.  It  is  a  communicatians  channel  tbat  allows  processes 
to  exchange  messages  according  to: 

VCl:  MesMges  sent  by  one  process  to  another  are 
delivered  uncomipted. 

This  work  is  supported  by  NSF  Grant  OCR-fi320274 
and  Offiee  of  Naval  Reaeareb  oonttaec  N00014-d6-K-0092. 


•.  - 


VC2:  Messages  sent  by  one  process  to  another  are 
delivered  in  the  order  sent. 

The  important  properties  of  tbe  virtual  dreuit  abstraction 
are  given  by  VCl  and  VC2.  brelevMt  details  concern 
how  VCl  Md  VC2  are  adiieved— induiSng  the  mesnge 
routing  pratoool  in  use  and  tbe  bardware  that  mnnecta 
proeeawia—and  how  the  virtual  dieuit  is  used— such  as 
whether  asynchronout  send  or  Ada-style  rendezvous  is 
supported.  And,  the  virtual  dreuit  abstraction  is 
implementablB — the  networking  world  contains  numerous 
implementations. 

^  This  paper  describes  abstractions  that  are  useful  in 
implementing  fault-toletMt  and  distributed  systems.  The 
same  abstractions  serve  both  fault  tolerance  Md  distribu¬ 
tion,  supporting  our  belief  tbat  the  two  concerns  are  not 
separable  The  abstractions  we  present  have  a  somewhat 
different  flavor  from  abstractions  prevalent  in  sequential 
Md  concur  rent  progtamming,  wtaiefa  encapsulate  state 
information  MtVor  provide  operations  to  manipulate  tbat 
state  (e.g.  a  stack  or  a  monitor).  Our  abstractions  are  best 
thought  of  as  properties  of  protooob  or  oomral. 

^Section  2  describes  abstractions  of  processors  that  cm 
fail  Md  section  3  reviews  some  fundamentals  for  coping 
with  failures.  Section  4  disaiswa  tbe  state  machinr. 
approach,  a  general  way  to  oonsttuct  fault-toleraot  distri-  / 
buted  computing  servioea.  The  state  madune  approach  [ 

motivates  two  abstractions:  Agreement  and  Order.  Sec-  V 

cion  S  disfuiira  a  second  approach  for  constructing  a  ' 

fault-tolerant  computing  service  Md  tbis  motivates  two 
more  abstractions:  Failure  Detection  Md  Stable  Storage. 

Implementing  abatractions  by  exploiting  hardware  is  dis- _ 

cyssed  in  section  6.  Section  7  disciissrs  related  v;ork.  _ 1 

2.  The  Fine  Prfot  B 

To  be  able  to  program  a  processor,  one  typically  con-  D 

suits  a  specification  that  definm  its  behavior.  The  spedfi-  □ 

cation  may  be  formal,  but  it  is  more  likely  to  be  a  mMual 

written  in  some  natural  Unguage.  The  spedfication  . . . 

explains  the  architecture  of  the  processor,  the  effects  of 

executing  each  instruction,  the  way  instructions  and  data  . . 

are  represented,  bow  the  console  switches  affect  execution, 
etc.  Invariably,  tbe  spedfication  does  not  explain  what  _____ 
bebastior  is  passible  or  likely  by  a  faulty  processor.  This  y  Codes 
omission  is  problematic  whn  fault  toletanoe  is  desired,  _____ 
since  it  deprives  tbe  programmer  of  informatian  necessary  )nd  /  or 
for  designing  softwm  that  cm  detect  Md  cope  with  Ciji 
failures.  i 


Nl 

.  V.  •  .  -j.  'j.  •  •>  V 


A  preetsior  faUure  oocun  when  the  proeeMor  no 
longer  setafies  ia  tp«fir»rinn  Behavior  in  fcspomc  to  a 
ftiiiitrc  cm  be  rbiMifliHi  aooonling  to  the  nature  of  the  cfisr- 
uption  it  causes. 

Bymntina  FaUnrea.  A  processor  cm  eihibit  artaitruy 
and  malidous  behavior,  perhaps  involving  cnlltainn 
««w»g  ocher  fmlty  proeeiaGW  [16]. 

Craah  raOnna.  A  proessanr  halB  in  responm  (o  a 
failure  p]. 

PaO-aap  PaOnrea.  A  preceseer  hala  ia  response  to  a 
faihirB;  other  proceiaora  cm  detect  that  the  tsilnre  has 
occurred  [28]. 

Nose  that  these  categottes  at  Mures  cm  be  viewed  as 
defining  processor  ahstncdans.  For  eaarepie,  a  Fail-stop 
Processor  is  one  that  ia  restricted  to  M-siop  failures. 

Byzmtine  Mures  are  the  most  disruptive.  Craah 
Mures  are  leas  chsruptive  because  processors  never  per¬ 
form  erroneous  actions — every  message  sent  md  state 
transformation  performed  is  oonsistent  with  the  program 
being  executed.  However,  unless  processor  execution 
speeds  are  known  or  processor  docks  are  approxiaiateiy 
synchronized,  it  is  not  possible  to  diatinguiab  between  a 
processor  that  is  executing  very  slowly  md  one  that  baa 
halted  due  to  a  etaah  Mure.  Yet,  the  ability  to  make  this 
(fistinction  cm  be  important  [9].  A  processor  that  has 
crashed  cm  take  no  further  action,  but  a  processor  that  ia 
merely  slow  can.  Other  processors  cm  safdy  perform 
acdona  on  behalf  of  a  crashed,  hence  halted,  processor,  but 
not  on  behalf  of  a  slow  one,  because  subsequrat  actions  by 
the  slow  processor  might  not  be  consistent  with  acdans  per¬ 
formed  on  its  behalf  by  others.  Fail-stop  Mures  are  the 
easkst  to  cope  with  because  processor  Mures  cm  be 
detected  by  other  processors  -  other  processors  can  safdy 
perform  actions  on  behalf  at  a  Med  processor.  And,  in  a 
system  where  pruueasors  have  appraodmately  syuduunized 
clocks  and  message  delivery  (^ys  are  bounded,  crash 
Mures  cm  be  converted  into  M-stop  Mures  by  using 
dfflcouts. 

A  system  that  cm  tolente  ByzmtiiK  Mures  cm 
tolerate  anything.  Since  m  application  that  makes  asmmp- 
tions  about  possible  behavior  at  faulty  proeessois  runs  tte 
risk  of  Ming  if  these  assumpcians  are  not  satisfied,  it  is 
prudent  that  life-cridcal  contral  syatema  tolerate  Byzmtine 
Mures.  However,  while  there  is  anecdotal  evidn»  that 
Byzmtine  Mues  do  occur,  there  is  no  published  deta  con¬ 
cerning  the  likelihood  at  such  Mures.  For  moir  applice- 
tions,  it  suffices  to  tolerate  crash  Mures  and,  where  neces¬ 
sary,  convert  there  into  M-stop  Muiea. 


3.  Replica tloo 

Failures  ran  be  partitioned  into  two  classes,  depend¬ 
ing  on  whether  repair  is  required  following  a  Mure.  A 
component  cm  exhibit  a  single  hard  Mure  but  multiple 
iransieiu  Mures  between  repairs.  The  oocutrence  of  a 
hard  Mure  influences  the  future  operation  at  the  device, 
while  the  occurrence  at  a  transient  Mure  does  not. 
Byzantine  failures  are  hard  because  the  Mure  might  des¬ 
troy  state  information  that  would  render  subsequent  opera¬ 
tion  meaninglcr-s;  crash  md  M-stop  Mures  are  hard 
because  the  device  halts  and  subsequent  opention  is 
impossible.  Communications  lines  often  exhibit  transient 
Mures  in  response  to  a  noire  burst— a  message  that  is  in 
transit  might  be  oorrupud,  but  subsequent  trmsmissions 
wUl  succeed. 


Failures— be  they  hard  or  transient— cm  be  detected 
only  by  replicating  actions  in  Mure-independent  ways. 
One  way  to  do  this  is  by  performing  the  action  using  com¬ 
ponents  that  are  physically  md  electrically  isolated;  we  call 
this  repUcation  in  jpaee.  The  validity  at  the  approach  fol¬ 
lows  from  m  empirically  justified  belief  in  the  indepen¬ 
dence  of  Mures  at  physiodly  md  electrically  isoiaied  dcv- 
ieea.  A  secoud  apprci^  to  replication  ia  for  a  single  dev¬ 
ice  to  repeatedly  petfom  tiie  ection.  We  call  tUa  rtpUca- 
dam  In  lima.  Replication  in  time  is  valid  only  for  tremieot 
Mures. 

If  the  results  of  performing  a  set  of  repUcaied  actions 
(filagree,  a  Mure  hse  occurred.  Without  malting  further 
aswimptiona.  thia  is  the  strongest  statement  that  cm  be 
made.  In  particular,  if  the  results  agree,  we  cmnot  assert 
that  no  failure  has  occurred  and  the  results  are  uorrecL 
This  is  hrcmre  if  there  are  enough  Mutes,  all  at 
teplicaa  might  be  corrupted,  yet  still  agree. 

Observe  that  r-*- 1-fold  replication  permits  Mure 
detection  but  not  failure  masking  when  there  are  as  mmy 
ss  I  Mures.  When  there  is  disagreement  among  t-t-l 
independently  obtained  results,  one  cannot  nssume  that  the 
majority  value  is  correct.  Masking  Mutes  requites  2r-i-l. 
fold  replication,  sinoe  thm  aa  mmy  at  t  values  cm  be 
faulty  without  cauang  the  majority  v^ue  to  be  &ulty. 

Requirements  for  Mk-toleimoe  are  usually 
in  terns  at  MTBF  (mean-time-between-Mures),  probabil¬ 
ity  of  Mure  over  a  given  interval,  and  other  statistical 
measures  [33).  While  it  is  dear  that  such  chancterizstions 
are  importmt  to  users  at  a  system,  there  are  advmtages  to 
deserihiBg  the  fault  tolerance  cf  a  system  in  terms  of  the 
maximum  number  of  Mures  that  cm  be  talented  over 
some  interval  at  interest.  We  sball  say  that  a  system  is  t- 
fault  toleimt  if  that  system  will  oantinue  to  operate 
oonectly  provided  t  or  fewer  Mures  (tocur.^  Asserting  that 
a  system  is  r-fault  tolerant  is  a  guarantee.  Thia  guarantee 
is  independent  at  the  reliability  of  the  components  that 
make  up  the  system  md  therefore  ia  a  measure  of  the  fault 
tolerance  supported  by  the  system  arduiectnre,  in  contrast 
to  fault  lolaanoe  achieved  simply  by  using  reliable  com¬ 
ponents.  Fault  toleianoe  of  m  actual  system  will  depend 
on  the  reliability  of  the  components  ua^  in  constructing 
the  system — in  particular,  the  probability  that  there  will  be 
t  or  more  Mures  during  the  operating  interval  at  interest. 
In  practice,  t  is  cboren  baaed  on  stetistical  measures  of 
component  reliability.  Once  t  has  hem  choren,  it  is  a  sim¬ 
ple  tnener  to  derive  the  usual  statistical  measures  at  relia¬ 
bility  by  computing  the  probsbiiities  of  various  oonfiguia- 
tions  of  0  through  /  Mures  and  their  consequences  [2]. 

4.  State  MbcUm  Approach 

One  way  to  employ  repUcation  in  space  for  a  function 
is  to  place  a  copy  of  a  program  that  implements  that  func¬ 
tion  on  each  of  the  processors  in  a  distributed  system.  Pro¬ 
vided  the  program  is  deterministic  and  each  copy  of  the 
program  receives  the  same  requests  in  the  same  order,  it 
will  do  the  same  thing,  hence  prothioe  the  same  results. 
The  results  of  these  copies  can  then  be  (xjnipared.  If  r-t- 1 
or  more  of  the  resulu  are  the  same,  this  result  can  be  used 
as  the  output  of  a  /-fault-tolerant  implementation  of  the 
function.  This  technique,  known  as  the  suue  machine 
approach,  was  first  des^bed  in  [12].  It  was  generalized  to 

‘A  (-fault  tolerant  system  might  oonnmie  to  operate 
correctly  if  more  than  t  failures  occur,  but  we  cannot 
guarantee  correct  operation. 


hnwrite  bil-ttap  Cailures  in  [27],  a  class  of  failmes  between 
crash  and  Byzantine  failures  in  [13],  and  full  Byzantine 
failures  in  [14]. 

The  name  “state  machine”  comes  from  viewing  a  prO' 
gram  as  an  automaton  that  repeatedly  irads  input 
(requests),  performs  a  computation,  and  generates  results. 
Not  only  is  this  view  cf  programs  simple,  it  is  generaL  For 
eanmpie,  pracess  eontnil  appUcatlnns  are  usually  forani* 
lated  in  temis  of  one  or  oor  loopK 

da  tnu  -  read  from  sensors 

compute  new  state  and  output  values 
write  to  actuaton 

od 

As  another  ezample,  a  memory  cell  can  be  vieiaed  as  a 
hardware  implementatian  of  a  state  tiwriiine.  Two  types 
of  requests  are  sernoed:  mod  and  whu.  b  response  to  a 
neud,  the  current  value  of  the  cell  is  fetched  and  reiunied 
as  the  output  of  the  state  machine;  in  response  to  a  write, 
the  current  value  of  the  cell  is  chang^  and  at:  ack¬ 
nowledgment  message  tctumed. 

The  state  approach  is  based  on  two  abstrac- 

dons: 

Agreement.  Every  non-faulty  copy  at  the  state 

machine  receives  every  request. 

Order.  Requests  ate  proeesaed  in  the  same  order  by 

every  non-faulty  copy  of  the  state  machine. 

Notice  that  among  the  (irrelevant)  details  suppressed  by 
these  abstracnons  are  the  behavior  of  faulty  prooesson  and 
the  praperties  of  the  oommunicadom  facility.  In  pordcu- 
lar,  the  Agreement  abstraction  masks  the  effects  of  prooes* 
sor  failures  on  the  disseminadon  of  values,  yet  sdpulates 
nothing  about  the  behavior  of  faulty  processors;  the  Order 
ubstracnon  synchronizes  message  receipt,  yet  sdpulates 
nothing  about  message  delivery  order  or  system  synchroni- 
dty.  Implementing  these  abstractions  will,  of  course, 
requite  attendon  to  these  details. 

Abstracdons  ate  of  value  only  if  they  can  be  imple¬ 
mented,  so  we  now  consider  implementing  Agreement  and 
Order  in  a  distributed  system  where  processors  can  exhibit 
Byzantine  failures.  We  assume  communicadons  lines  that 
ate  only  subject  to  transient  failures  due  to  noise  bursts 
and  that  the  networit  has  suffident  connectivity  so  that 
despite  bard  failures  every  processor  can  communicate  with 
every  other.*  We  also  assunK  that  speed  (fiffetenoes 
between  (ncm-faulty)  processors  and  message  delivery 
delays  have  a  known  bound— in  ptaedee,  a  teasonaUe 
assumption — and  that  clocks  on  non-faulty  processors  ate 
approximately  syndoonized— not  a  reasonable  assutttpdon, 
but  one  that  can  be  discharged  using  any  of  the  Byzantine 
clock  synchronization  protocols  that  have  been  devised 
IL<][19][20]. 

Before  considering  implementations  for  the  Agree¬ 
ment  and  Order  abstractions,  we  describe  an  interprocessor 
communicadons  service  that,  provided  r  or  fewer  failures 
occur,  sadsfies  properties  VCl  and  VC2  of  a  virtual  circuit 
(see  section  1),  as  well  as  the  following  authentication  pro¬ 
perty. 

Auth;  A  process  can  tign  a  message  it  receives  and 
then  forward  it  to  any  other  process.  Any  pro¬ 
cess  that  receives  such  a  signed  message  con 

'There  is  no  way  to  coorsfinate  processors  if  they  can¬ 
not  comrminicaie. 


determine  whether  the  message  has  been  cor¬ 
rupted  since  it  was  signed. 

Property  VCl  can  be  satisfied  in  our  system  by 
resending  each  message  until  /+1  identical  copies  have 
been  received.  (To  tolerate  at  most  t  failures,  a  maximum 
of  2/-I-1  copies  of  a  message  will  have  to  be  sent)  This  is 
a  form  of  re{dication  in  time  and  is  appropriate  because  of 
the  assumptioot  made  above  about  the  cotniminicatinns 
network.  The  traditkmal  use  of  check-sums  to  trigger 
retrantmdsBon  cf  corrupted  messages  is  actually  an  optimi- 
zation  of  this  approach  that  assumes  a  fulure  causing  oorr^ 
uption  of  a  mmsage  ahmys  causes  inconsisienqr  between  a 
— ]t»  and  its  cheek-sum.  The  chedt-sum  replaces  hav¬ 
ing  agreement  on  t-f  1  copies  of  the  message;  the  request 
for  leoansmission  causes  additinnal  copies  of  the  mrmagr 
to  be  sent  only  when  necessary  and  only  until  a  copy  is 
received  that  is  (apparently)  uneomipted. 

Property  VC2  can  be  achieved  by  the  sender  indud- 
ing  sequence  numbers  in  messages  and  having  the  receiver 
buffer  messages  and  deliver  them  in  strict  order  by 
sequence  number. 

Property  Auth  can  be  approximated  by  using  cbeck- 
suma.  This  approximation  works  htrcatisr.  fsilures  typically 
result  in  tandom,  rather  than  truly  malicious  b^vior. 
Carefully  designed  redundancy  is  usually  sufficiem  to  cope 
with  random  malidous  behavior.  Digital  signatures  [2S], 
which  are  somewhat  more  costly  than  check-sums,  can  be 
used  if  intelligent  malidous  behavior  is  antidpated. 

4.1.  Agreement  Abstractioa 

The  Agreement  abstractioa  can  be  realized  by  devis- 
ing  a  proto^  that  allows  a  desigruted  prooesaor,  called 
the  transmitter,  to  dissetninaie  a  value  to  other  processors 
in  such  a  way  that 

ICl:  All  non-faulty  processors  agree  on  the  same 

value. 

IC2;  If  the  transmitter  is  non-Esulty  then  all  non- 
faulty  ptweasois  use  its  value  as  tbe  one  they 
agree  on. 

The  hard  part  in  implementing  such  a  protocol  is  being 
able  to  cope  with  a  transmitter  ezhibiting  Bymntine 
failures.  a  tnnsmitier  might  send  different  values  to 
(Sflerent  pimeasors  in  an  attempt  to  violate  ICl.  dearly, 
prooesson  must  exchange  tbe  values  they  leoeive  bora  tk 
tnnsmitier  aiiwng  themadves  before  agreeing  on  any 
value. 

To  ensure  ICl,  it  is  sufficient  that  when  the  protocol 
terminafes,  the  set  of  values  received  by  each  non-faulty 
processor  is  tbe  same,  because  then  e^  prooesKr  can 
compuie  some  fuixtian  on  tbe  contents  of  thia  set  and 
obtain  a  value  on  which  all  will  agree.  However,  the 
details  of  ensuring  this  and  that  IC2  also  holds  are  subtle. 

One  problem  is  that  faulty  prooesson  might  selec- 
cively  fail  to  relay  values.  Then,  a  processor  might  be 
delayed  forever  awaiting  a  value.  To  handle  this  diffi¬ 
culty,  the  assumptions  made  above  about  dock  speeds  and 
message  delivery  delays  are  exploited.  Approxifnaedy  syn¬ 
chronized  docks  and  bounded  message  delivery  delays 
allow  a  processor  to  determine  when  it  can  expect  no 
further  messages  from  non-faulty  prooesson.  Thus,  no 
(faulty)  prooessor  can  cause  anotfan  to  be  delayed  indefin¬ 
itely  awaiting  a  message  that  will  never  arrive. 

A  second  problem  is  that  a  faulty  prooessor  might 
relay  different  values  to  different  processon.  However, 


iiae  at  signed  mesaages  prevents  a  (faulty)  prooesaor  p 
from  forging  a  message.  And,  if  p  receives  a  message  m 
from  q,  then  an  attempt  by  p  to  change  the  contents  of  m 
before  relaying  it  will  not  succeed:  q  will  have  signed  the 
message  and  so  p’s  tampering  with  its  contents  will  invali¬ 
date  q’i  signature,  which  would  be  detectable  by  any  red- 
pient  of  the  message  due  to  Auth.  Tims,  using  signatures 
ensures  that  every  value  received  by  a  processor  is  either 
incorrectly  signed  aad  detected  as  such,  or  ooe  at  the 
values  originally  seat  by  the  transmitter. 

The  remaining  proUem  is  to  ensure  that  not  only  are 
the  values  received  by  each  non-faulty  processor  a  subset 
of  the  values  sent  by  the  transmitter,  but  that  all  non- 
faulty  pmeejiors  agree  on  the  conienB  at  this  set.  TUs  is 
solved  by  having  proceasoti  sign  and  relay  the 
they  originally  receive  from  the  transmitter.  U  at  most  t 
procesiors  can  be  &ulty,  then  i-r-1  rounds  of  mrisur  relay 
ate  necessary  and  suffidrnt  [7].  To  see  that  1  rounds 
are  sufDdem,  note  that  if  the  transmitter  is  faulty  then  t 
rounds  ensure  that  every  message  is  seen  by  at  least  one 
non-faulty  processor,  and  a  non-faulty  processor  will  fallow 
the  protocxil  and  therefore  forward  the  message  to  all  other 
procesaots;  if  the  transmitter  is  nrm  faulty,  then  its  value  is 
the  only  one  that  will  ever  be  amepted  by  a  non-faulty  pro¬ 
oesaor  due  to  the  terpiirenients  abrm  signatures. 

Putting  all  this  together,  we  get  the  fotkiwing  prococrrl 
for  the  Agreement  abstraction,  assuming  there  are  no  mare 
than  I  faulty  processrrts  in  the  system.  A  process  making  a 
request  of  the  state  "«<■>«««>!  is  the  transmitter,  the  coptes 
of  the  state  nwichine  are  the  other  proeesaon. 

AgreemenS  Protocol: 

The  transmitter  signs  and  sends  a  copy  of  its 
value  to  every  prooesaor. 

Every  other  pracesaor  p  peifonns  r-fl  rounds, 
as  follows.^  Whenever  p  receives  a  message  in 
round  <  with  I  signatures,  intficating  that  the  tm- 
sage  has  been  relayed  through  i  processes  without 
being  nmdified,  it  adds  the  value  in  that  message 
to  Vp,  the  set  of  values  it  has  received,  appends  its 
signature  to  the  message,  and  relays  the  signed 
message  to  all  processors  that  have  txK  already 
signed  iL 

At  the  end  of  the  r-*-!”  round,  to  select  the 
agreed  upon  value,  each  processor  oonqwtes  the 
sanK  given  function  on  V^. 

At  first,  the  ot«  of  this  protocol — r-r-l  rounds  of  mes¬ 
sage  exchange — might  appear  prohibitive,  hi  fact,  there 
exist  protocols  to  establish  ICl  and  IC2  where  the  number 
of  nountis  requred  is  proportional  to  the  number  of  failures 
that  actually  occur  during  execution  of  the  protocol  [35]. 
When  there  are  few  failures,  these  protocob  are  compar¬ 
able  in  orst  to  2-  and  3-phaae  commit  protocols 
[10][17](29].  However,  they  are  considerably  more  fault 
tnlerant-— classic  commit  protix»ls  are  unable  to  tolerate 
Byzanuse  failures.  There  is  an  extensive  literature  oon- 
cerning  implementing  the  Agreement  abstraction,  which  is 
vanously  called  the  Byzantine  Generals  Problem  and  the 
Gmsensus  Problem.  Sro  [8]  and  [32]. 


^Since  the  clocks  on  non-faulty  processors  are  approxi¬ 
mately  synchronized  and  message  rlelivery  time  is  bound¬ 
ed,  each  processor  can  independently  determine  when  each 
round  starts  and  finishes. 


4.2.  Order  Abstnetioa 

Our  next  task  is  to  support  the  Order  abstiaction. 
For  this,  we  employ  timestamps.  Every  request  dissem- 
inated  to  the  ensemble  of  state  machine  copies  is  given  a 
timestamp  using  the  cm  the  processor  making  the 
request.  Assunting  that  these  dodcs  have  suffidcat  roolu- 
tion  to  ensure  that  two  rerpmsts  from  the  same  user  of  the 
state  are  (fitCeieot  the 

tamps  will  rlefine  a  fled  order  OB  teiinestB.  (Tiroreqneats 
svith  the  same  from  dflereat  usen  are  ordered 

according  to  the  user  names,  whicb  are  assiimrd  to  be 
unique.) 

It  is  not  sufficient,  however,  for  state  copies 

simply  to  preceas  requests  that  have  been  received  in 
ascending  order  by  We  must  ensure  that  a 

copy  at  the  state  procmea  a  request  only  if  no 

request  with  a  smaller  timestamp  can  be  lufatequently 
received.  We  shall  say  that  a  request  it  jtabU  atp  once  no 
request  with  lower  timestamp  can  be  delivered  to  the  state 
machine  copy  running  on  processor  p.  Geaiiy,  a  request 
should  be  processed  only  after  it  hBfnniM  stable. 

Testing  stability  of  a  request  can  be  accomplished  by 
exploiting  the  bounds  on  delivery  delays  and  processor 
clocks.  If  requests  are  (flsseminated  using  an  agreement 
protocol  like  tte  one  above,  then  there  will  exist  a  ooostam 
A  such  that  a  mrseegr  y  by  transmitter  p  will 

be  received  by  r+A  at  every  other  processor  acoonfing  to 
each  processors  local  dock.  (See  [6]  for  a  detailed  deriva¬ 
tion  for  A  in  a  variety  of  environments.)  Once  the  dock 
on  a  prooesaor  p  reads  time  r,  p  cannot  subsequently 
receive  a  request  with  timestamp  less  than  r  -  A.  There¬ 
fore,  a  stability  test  for  use  with  prooessors  that  can  exhibit 
arbitrary  behavior  vben  they  fail  is: 

Order  Protocol: 

Stable  requests  are  processed  in  ascending  order 
by  timestamp.  A  request  is  stable  at  /?  if  the 
timestamp  on  the  request  is  T  and  the  dock  at  p 
has  a  value  greater  thu  T+A. 

5.  Cbcckpointa  and  Restarts 

A  second  general  approach  for  achieving  fault  toler¬ 
ance  is  periodically  to  save  state  information  as  check- 
potetr,  and  reaart  the  computation  from  its  last  checkpoint 
when  a  failure  occnix.*  This  approach  requires  that  two 
assumptions  hold. 

(1)  Failures  are  detected  before  the  results  of  invalid  state 
transformations  are  saved  in  a  checkpoint 

(2)  Checkpoints  ate  unaffected  by  failures  and  remain 
available. 

These  assumptions  permit  a  computation  to  be  restarted 
after  a  failure  from  one  of  its  prior  states  that  was  saved  as 
a  checkpoint 

As  before,  we  discharge  these  assumptions  by  postu¬ 
lating  abstractions.  We  assume  storage  is  partitioned  into 
volatile  storage  and  stable  storage  [17]  such  that: 

KaOnre  Detacdon.  Failures  are  detected  before  any 
visible  erroneous  state  transformation  is  performed. 

Stable  Storage.  Information  stared  in  stable  storage 
remains  available,  despite  failures. 

''A  variety  of  language  constructs  have  been  defined  to 
support  this  style  of  programming  fault-tolerant  applica¬ 
tions.  See  [24]  and  [26]. 


-5- 


Mote  that  impiemendng  the  Failure  Detecdon  abstractun  ia 
equivalent  to  implementing  the  Fail-stop  Praeeaor  abnne> 
don  of  secdon  2. 

Replicadon  in  space  ««  be  used  to  implement  the 
Failure  Detecdon  absoacdon  in  a  system  ci  processors  that 
can  exhibit  Byzandne  failures  and  have  bounded  speed 
(hfCerences  and  message  delivery  delays,  bi  order  to 
tolente  /  failures,  every  cooputadan  is  perfamed  on  t-fl 
procesiore  and  the  lesnla  at  each  etep  an  cnmparrjd. 
Whenever  a  disagreement  oocuia,  a  failure  has  been 
detected.  The  details  of  ensuring  that  all  copies  receive  the 
same  input  in  the  order  are  the  «■««  as  described 
above  for  the  state  "«■«*«""  approach  vre  eaplay  our 
Agreement  and  Order  abstracdona.  Hireever,  for  Failure 
Detection,  only  t+  1-foid  replicadon  is  required,  while  for  a 
repUcated  state  machine,  2ir-M-foid  repUcadon  was 
required.  A  state  "Mrhifm  that  is  teplkated  2i-»-l-fald  can 
made  failures,  though;  our  Failure  Detecdon  implementa¬ 
tion  carmot. 

The  Stable  Storage  abstiacdon  can  also  be  impk- 
mented  in  such  a  system  by  using  replicadon  in  space. 
Here,  however,  data  must  be  replica t^  2/+ 1-fold  since 
data  must  remain  accessibk  even  if  as  many  as  t  failures 
occur.  Again,  the  state  marhinr  appraadh  is  used.  The 
state  responds  to  read  and  write  requests  in  the 

obvious  way.  The  difference  between  implementing  Stable 
Storage  using  a  state  and  replicating  the  entire 

oomputadon  2i+  l-fold  is  one  of  case  JI  aociesses  to  Stafak 
Storage  are  infiequent,  the  processing  power  allocated  to 
state  machine  imt«n«»«  running  the  Stabk  Storage  abstrac- 
don  can  be  modest,  in  contrast  to  what  might  be  required 
to  run  the  entire  computation  replicated  2t+l  times. 

If  procesaors  exhibit  only  crash  failures  or  fail-stop 
failures,  then  the  implementation  of  the  Failure  Detecdon 
abstiacdon  remains  unchanged,  but  the  Stabk  Storage 
abstraction  can  be  implemented  with  t+  1-fold  replicadon. 

6.  Exploiting  Hardware 

The  implementations  described  above  are  quite  con¬ 
servative.  All  assumptions  necessary  for  oonectnesa  are 
explidt,  and  we  generally  made  no  assumpdons  about 
beluvior  of  faulty  processors.  The  implementations  are 
also  expensive  perhaps  too  expensive  for  all  but  the  moat 
critkal  appUcadons.  Other  ImpkirriitatioM  do  exist, 
although  we  avoided  them  hwatisr.  invariably  they  involve 
stronger  assunqidons. 

Exploiting  hardware  is  a  cast-effective  way  to  support 
our  abstracdom  in  praedee.  For  exampk.  Agreement  and 
Order  might  be  implemented  by  using  a  tripk-redundant 
bus.  Failure  Oetoedon  (or.  Fail-stop  Procesaort)  by 
hardware  monitqmg  of  key  paints  in  the  prooesaor  [11], 
and  Stabk  Storage  by  using  mirrored  (dual)  (fisks.  Such 
hardware  impkmentadons  are  baaed  on  engineeting  data 
about  bow  components  fail.  Whik  it  is  usually  impoisibk 
to  guarantee  that  these  impkmentadons  will  continue  to 
work  in  the  presence  of  (any)  t  failures,  it  is  posubk  to 
assert  that  they  will  work  provided  certain  othCT  assump- 
dons  are  valid  about  the  number  and  nature  of  failures 
during  some  interval  of  interest.  Thus,  i-fault  tokrsnee  is 
really  a  special  vase  of  what  s*e  might  call  A-fmlt  toler¬ 
ance,  which  snpulates  that  an  impkmentadon  sadsfks  its 
spedfleadon  provided  some  set  of  assumpdons  A  remains 
valid.  Engineering  data  about  component  failures  allows  a 
set  of  asaumptiom  A  to  be  ehoaen. 

Computer  designers  frequendy  design  an  arehitBCture 
and  then  inveadgaie  realizadons  of  that  architecture  with 


different  price/petfonnanee  rados.  Our  abstnedons  for 
fault  tolerance  give  the  architect  of  a  fault-tokrant  applica¬ 
tion  the  opportunity  to  consider  impkmentadons  with  tfif- 
ferent  pricdfault-tokrance  radca.  When  designing  in 
terms  d  our  abstractions— at  least,  in  theory  one  need 
no:  decide  between  hardsvare  and  software 

implementations  until  the  entire  design  is  completB.  More¬ 
over,  the  architect  can  cmtemplate  A-fault  teieram  imple¬ 
mentations  for  (fifierent  assumptions  A,  thereby  permitting 
additional  tradeofb  between  price  and  faidt-tolennee. 
Hnally,  the  abstractions  permit  a  separation  of  concerns 
and  a  portahiiity  of  design  ntx  otherwise  poasihk. 

By  no  means  is  there  svidesptead  agreement  among 
researdiers  or  practitioners  on  what  are  the  tight  abstrae- 
dons  for  distributed  systems,  fault-toleram  systems,  or  tfis- 
tiibuted  fauit-tolerant  systems.  This  paper  describes 
abstracdons  that  we  have  found  usefuL  Although  the  par¬ 
ticular  abstractions  were  motivated  by  specific  approaches 
to  achieving  fault-tolerance,  they  are  quite  gene^.  For 
exampk,  the  Agreement  and  Order  attractions  underiy 
most  distributed  synchronization  mechanisms,  even  when 
fault  toleranee  is  not  at  concern  [27].  Failure  Detection 
has  appUcadon  beyond  the  chedtpoint/restart  approach 
drarrihtd  in  section  S;  the  literature  is  replete  with  algo¬ 
rithms  that  require  such  a  service. 

The  most  surprising  thing  about  our  abstracdons  is 
their  form.  Abstractions  common  in  sequential  and  con¬ 
current  programming  relate  state  information  and  opera¬ 
tions  to  manipulate  that  state.  Such  abstiactians  can  usu¬ 
ally  be  presented  to  the  programmer  as  language  constructs 
(e.g.  the  monitor  construct  or  coroutine  contrd-regime)  or 
as  a  moduk  instance  (e.g.  a  stack  or  tree).  Our  abstrac¬ 
tions  are  best  presented  to  the  programmer  as 
guarantees — assertions  that  can  be  used  to  simplify  a  pro- 
gnm  by  eliminadng  pnsiibk  executions.  The  ISIS  project 
[4]  is  currendy  investigating  this  approach  to  fault-tokiut 
distributed  systems  by  providing  kernel-level  support  for 
Agreement,  Ordering,  and  Failure  Detection. 

Other  approaches  to  designing  distributed  (fault- 
tokrant)  systems  that  are  being  discussed  extend  the 
semantics  of  instructions  antVor  data  to  help  the  program¬ 
mer  cope  with  distribudoa  and  &ult-taleianee.  Aiigus[18], 
CLOUDS  [1],  an  early  vmiao  of  SB  P],  LOCUS  [22], 
and  TABS  pi]  all  provide  transactions  as  an  abstraction 
for  structuiing  fsult-toleiant  appilcations.  A  trwaaction  is 
like  an  instruction,  except  that  its  execution  is  atomic. 
Intermediate  states  of  a  transaction  are  never  visitde,  even 
if  a  fulure  occurs  during  its  execution.  Note  that  a  tran¬ 
saction  that  aborts  due  to  a  failure  can  be  re-executed, 
since  the  state  will  be  unchanged  from  what  existed  when 
the  first  execution  of  the  transaction  was  attempted.  Thus, 
a  sequence  of  transactions  sperifics  a  computation  not 
unlike  the  checkpoint/restatt  approach  describe  above.  In 
addition  to  sequential  composition,  transactions  can  be 
composed  hierarchically,  resulting  in  nested  transactions 
[21].  By  making  the  grain  of  computation  small — a  conse¬ 
quence  of  nesting  transactions — we  can  decrease  the  inter¬ 
val  that  must  elapse  benveen  failures  in  order  to  make  pro¬ 
gress  and  therefore  increase  the  probability  that  progress 
can  be  made.  Notice  that  implicit  in  the  notion  of  a  tran¬ 
saction  is  some  form  of  Stabk  Storage. 

The  use  of  remote  procedure  r»ll«  [23]  provides  the 
basis  for  another  approach  to  designing  fault-tokrant  dis¬ 
tributed  systems.  A  remote  procedure  call  is  like  an  ordi- 
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naiy  praoeduie  call,  except  the  procedure  body  can  be  exe¬ 
cute  by  a  different  processor  from  the  one  executing  the 
caller.  This  bides  distribution  from  the  programmer. 
When  done  properly,  it  can  also  hide  certain  types  of 
failures.  Implemrating  remote  procedures  so  that  the  pro¬ 
cedure  body  is  performed  exactly  once— not  at  most  once, 
or  at  least  once — is  a  hard  problem  [23][30].  Replicated 
remote  procedures  [S]  provide  a  way  to  support  repUeadon 
in  space  with  a  remote  procedure  call  interface.  Hm,  exe¬ 
cuting  a  single  call  causes  a  copy  of  the  procedure  body  to 
be  executed  on  a  number  of  prooesfora;  the  results  of 
executions  are  returned  to  the  caller  and  oompared. 

The  idea  of  structuring  systems  in  terms  of  abstrae- 
dons  is  not  new.  Identifying  aburactions  for  fault-tolerant 
and  distributed  systems  is.  Only  expetienoe  in  using  and 
implementing  an  absttacrion  will  permit  us  to  evaluate  its 
utility.  In  the  meantime,  it  behoova  the  designer  at  a  sys¬ 
tem  to  think  in  terms  of  abstractions — he  they  new  ones  or 
existing  ones — because  only  then  will  we  stop  reinventing 
the  wheel. 
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